Creating a Corpus of Auslan within an Australian National Corpus
نویسنده
چکیده
The creation of signed language (SL) corpora presents special challenges to linguists. They are face-to-face visual-gestural languages that have no widely accepted written forms or standard specialist notation system, making even superficial transcription problematic. SL corpora need to be created taking these facts into account. Using the example of Auslan (Australian Sign Language) this paper describes how multimedia annotation software can now be used to transform a language recording into a machine-readable text without it first being transcribed, provided that conventional linguistic units are systematically and consistently identified, thus making possible the creation of a true linguistic corpus of a SL. Before examining SL annotation in detail, we first review the main features of modern linguistic corpora and introduce the Auslan archive which is the source of the future Auslan corpus. The paper concludes with an assessment of the place of an Auslan corpus within an Australian National Corpus and an evaluation of other recent SL corpus projects elsewhere in the world. A modern linguistic corpus is something more than just a dataset of written or transcribed texts upon which a description or an analysis of a language is based. This sense of corpus has now essentially been superseded in the literature (e.g., McEnery & Wilson, 2001; Sampson & McCarthy, 2004; Hoey, Mahlberg, Stubbs, & Teubert, 2007). A corpus in the modern sense means a collection of written and spoken texts in a machine-readable form that has been assembled for the purposes of studying the type and frequency of constructions in a language. A modern linguistic corpus contains linguistic annotations and appended sociolinguistic and sessional data (metadata) that describe the participants and the circumstances under which the data were collected. With the development of digitized video recording and multimedia annotation software, a corpus of a signed language (henceforth, SL) can now be described as a subtype of ‘spoken’ language corpora, namely face-to-face language. SL corpora promise to vastly improve peer review of descriptions of SLs and make possible, for the first time, a corpus-based approach to SL analysis. Corpora are important for the testing of language hypotheses in all language research at all levels, from phonology, through lexis, morphology, syntax, and pragmatics to discourse. There are several reasons why testing is particularly relevant in the field of SL studies. First, SLs, which are invariably young languages of minority communities, lack written forms and the well developed communitybased standards of correctness that often accompany literacy. Second, they have interrupted generational transmission and few native speakers. Third, the representation of SL examples using written glosses has meant that primary data have remained essentially inaccessible to other researchers and consequently unavailable for meaningful peer review. Thus, although introspection and observation can still be of valuable assistance to linguists developing hypotheses regarding SL use and structure, one must also recognize that intuitions and researcher observations may fail in the absence of clear native signer consensus of phonological or grammatical typicality, markedness, or acceptability. The previous reliance on the intuitions of small numbers of informants has thus been problematic in the field. As with all modern linguistic corpora, SL corpora should be representative, well-documented, and machine-readable (McEnery & Wilson, 2001; Teubert & Cermáková, 2007). This not only requires dedicated technology and standards (e.g., Crasborn et al., 2007), it also requires a principled methodology for transcription or annotation. The guiding principle behind the linguistic annotations being created in the initial stages of an Auslan corpus is machine-readability, not transcription narrowly understood. The aim is to create an
منابع مشابه
Ingesting the Auslan Corpus into the DADA Annotation Store
The DADA system is being developed to support collaborative access to and annotation of language resources over the web. DADA implements an abstract model of annotation suitable for storing many kinds of data from a wide range of language resources. This paper describes the process of ingesting data from a corpus of Australian Sign Language (Auslan) into the DADA system. We describe the format ...
متن کاملDeveloping a Quality Spoken Component of the Australian National Corpus
The creation of a quality spoken component of the Australian National Corpus (AusNC) will allow us to deepen our understandings of Australian English (AusE) and to open up new areas of analysis. To make the most of this opportunity we contend that not only must the data be of high quality but that the corpus must also be constructed in such a way that the data is of maximal use to researchers w...
متن کاملThe Australian National Corpus: National Infrastructure for Language Resources
The Australian National Corpus has been established in an effort to make currently scattered and relatively inaccessible data available to researchers through an online portal. In contrast to other national corpora, it is conceptualised as a linked collection of many existing and future language resources representing language use in Australia, unified through common technical standards. This a...
متن کاملGearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section
Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...
متن کاملTowards the Design of the Australian National Corpus
Corpora are becoming more and more important as a research tool for linguists as they are large collections of authentic text. However, not every researcher has the time and resources to compile their own corpus. Large corpora in the world such as the BNC, the ANC or the International Corpus of English (ICE) have been widely used for research on the English language in general or an English dia...
متن کامل